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Abstract 

Recent research has highlighted the practical benefits of subjective in- 
terestingness measures, which quantify the novelty or unexpectedness of 
a pattern when contrasted with any prior information of th e data miner 
(jSilberschatz and Tuzhilinl . 1 19951 : iGeng and Hamilton! . |2006[ ) . A key chal- 
lenge here is the formalization of this prior information in a way that 
lends itself to the definition of an interestingness subjective measure that 
is both meaningful and practical. 

In this paper, we outline a general strategy of how this could be 
achieved, before working out the details for a use case that is important 
in its own right. 

Our general strategy is based on considering prior information as con- 
straints on a probabilistic model representing the uncertainty about the 
data. More specifically, we represent the prior information by the max- 
imum entropy (MaxEnt) distribution subject to these constraints. We 
briefly outline various measures that could subsequently be used to con- 
trast patterns with this MaxEnt model, thus quantifying their subjective 
interestingness. 

We demonstrate this strategy for rectangular databases with knowl- 
edge of the row and column sums. This situation has been considered 
before using computation intensive approaches based on swap random- 
izations, allowing f or the computatio n of empirical p-values as interest- 
ingness measures |Gionis et all . 120071 ). We show how the MaxEnt model 
can be computed remarkably efficiently in this situation, and how it can 
be used for the same purpose as swap randomizations but computation- 
ally more efficiently. More importantly, being an explicitly represented 
distribution, the MaxEnt model can additionally be used to define ana- 
lytically computabl e interestingness measures, as we demonstrate for tiles 
(jCeerts et all ' l2004l ) in binary databases. 

Keywords: Maximum Entropy Principle, Subjective Interestingness 
Measures, Prior Information, Rectangular Databases, Swap Randomiza- 
tions 
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1 Introduction 



1.1 Prior work on subjective interestingness 

Prior information and interestingness of patterns Data mining prac- 
titioners commonly have a partial understanding of the structure of the data 
investigated. The goal of the data mining process is then to discover any addi- 
tional structure or patterns the data may exhibit. Unfortunately, structure that 
is trivially implied by the prior information available is often overwhelming, and 
it is hard to design data mining algorithms that look beyond it. 

For example, it should not be seen as a surprise that items known to be 
frequent in a binary database are jointly part of many transactions, as this is 
what should be expected even under a model of independence. Rather than 
discovering such patterns that are implied by prior information, data mining is 
concerned with discovering departures from this prior information. 

Interestingness measures that take into account prior information in this way 
are commonly re ferred to as subjective interestingn ess measures, first introduced 



as a concept in iSilberschatz and Tuzhilinl (| 1995T ) . In contrast with objective 



interestingness measures (such as the support of an itemset and the confidence 
of an association rule) , they do not depend on the data alone but also on the prior 
information of the data miner. An excellent overview of sub jective and objective 



intere stingness measures for data mining can be found in iGeng and Hamilton 
( 20061) . 



To define subjective interestingness measures, the ability to formalize prior 
information is as important as the ability to contrast patterns with this infor- 
mation thus formalized. In this paper, we mainly focus on the first of these 
challenges: the task of designing appropriate models incorporating prior infor- 
mation in data mining contexts. However, we will also outline various possible 
approaches of how such a background model can be used to define subjective 
interestingness measures, and we will demonstrate one in greater detail on a 
practical use case. 

Prior work on subjective interestingness measures Several authors have 
already suggested ways to incorporate prior information in the data mining 
proce ss for this purpose of defin i ng su bjective interestingness measures. 

In ISilberschatz and Tuzhilinl (|1995l ). which intr oduces the idea of subjective 



interestingn ess measures, and in later work (e.g. iPadmanabhan and Tuzhilin 



19981 2000), prior information is formalized as a set of beliefs, each of which 



holds with a certain confidence. The beliefs they consider are of the form of 
rules X — >• Y where X and Y are conjunctions of literals. Patterns in the 
form A — > B are then assessed for unexpectedness in a well-defined way with 
respect to each belief. A disadvantage of this approach is that it is local in that 
each belief is treated independently of the others. Furthermore, it is specifically 
designed for patterns in the form of rules. 



An approach that overcomes these problems was proposed in lJaroszewicz and Simovici 



( 2004), still for binary databases. They propose to use a Bayesian network model 
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for the transactions to formalize background knowledge. They then use this 
model to compute the difference between the expected frequency of an itemset 
and its observed frequency in the data as a subjective measure of interestingness. 

Despite its potential, this approach suffers from a few limitations. First, 
it may not always be clear how the Bayesian network needs to be designed to 
accord with the prior information. The approach is particularly impractical for 
data mining practitioners unfamiliar with Bayesian networks. Second, it treats 
transactions as i.i.d. random variables. And third and probably most seriously, 
the variables in the Bayesian network are the items (or attribute- values) , such 
that prior information on individual transactions cannot be taken into account. 

An approach that resolves all three these p roblems, albe i t for p articular types 
of data and prior information, is presented in lGionis et all ( 20071) . In this work, 
the authors show how one can assess the significance of data mining results 
in binary databases with respect to prior information on the row and column 
sums. Their methodology relies on swap randomizations, which leave the row 
and column sums invariant. By iteratively applying swap randomizations they 
show how one can approximately sample from the uniform distribution over 
all databases with row and column sums as specified by the prior informat ion. 
This can be used by computationally intensive approaches (e.g. lGentld . 120051) for 
estimating the significance of data mining results as quantifie d by the empirica l 



p- value. Later this work was extended to real- valued data ( Oiala et"al. 200 



and t o more complex constraints besides row and column sums (jHanhiiarvi et a 
20091) . 



The statis tical assessment o f data mining resu l ts using the ran dom ization 

metho ds from lGionis et all (|2007l ) , lOiala et all (|2008l ) , iMannilal (|2008l ) , and lHanhijarvi et all 
(2009j) is extremely useful and deserves a central place in data mining practice. 
However, it would be even more useful if a model for prior information could 
be used to directly guide data mining algorithms toward the subjectively more 
interesting patterns. Unfortunately, from a practical point of view, the use of 
models that are defined implicitly in terms of invariants seems limited to post- 
hoc analyses. Indeed, it seems hard to scale algorithms that need to explore an 
entire search space of possible patterns if they need to assess each candidate by 
means of a randomization test or by referring to a large number of randomized 
data sets. Thus it is unclear if and how they could be used to define practical 
measures of interestingness other than empirical p- values. A further and at least 
as serious disadvantage of randomization methods is that their resolution is lim- 
ited by the inverse of the number of randomized data sets considered. This is a 
problem in the highly relevant region of small p-values where a high resolution 
is important. 

In contrast to this, an explicit analytical model capable of formalizing im- 
portant types of prior information would enable one to assess patterns in an 
analytical way rather than in a computationally intensive w ay. Interestingness 
could then be quantified using exact hypothesis testing as in lGallo et all (|2007 , 
l2009h . where a relatively simple independence model for items and transactions 
was used as a null model formalizing prior information. Alternatively, informa- 
tion theoretic principles could be applied to quantify the information content 
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of a pattern, as done in ISiebes et all ([20061 ) for denning an objective interest' 



ingness measure, but then with respect to a background model defined by the 
prior information. We will argue that the results in this paper will make this 
possible. 

1.2 Contributions in this paper 

In this paper, we present a methodology for efficiently computing explicitly rep- 
resentable probabilistic models for general types of data, able to incorporate 
broad classes of prior in fo rmation . Our approach is based on the maximum en- 
tropy (MaxEnt) principle ( Javnesl . Il982j ). In Sec.[SJ we first sketch the method- 



ology in its full generality, and we briefly outline various ways in which such 
a MaxEnt model could be used to define subjective interestingness measures. 
This general framework is the first contribution in this paper. 

In the second part of the paper we demonstrate this approach for rectangular 
databases with constraints on the row an d column ma r ginals as prior informa- 



tion, and for patterns in the form of tiles iGeerts et all (|2004f) . The purpose of 
this second part is twofold. First, this particular use case is important in its own 
right, and has rec e ived a significant amount of attent i on in the literature (e.g. 



Gionis et at 120071: lOiala et at 120081: lHanhiiarvi et at [2009). Second, we hope 



that elaborating on this use case may support and clarify the general approach 
outlined in Sec. [21 thus underscoring its wider potential for the definition of 
subjective interestingness measures also in other situations. 

This second part of the paper is structured as follows. In Sec. we derive 
the MaxEnt model for rectangular databases under row and column sum con- 
straints, and show how it can be computed remarkably efficiently. We do this for 
binary, positive integer- valued and positive real- valued databases as it comes at 
virtually no extra cost as compared to just dealing with binary data. In Sec. SJ 
we relate these MaxEnt models to distributions defined implicitly by swap ran- 
domizations. In particular, we prove invariance of these MaxEnt models to a 
generalized type of swap randomization. In Sec. [5j we show that it is computa- 
tionally cheap to sample randomized databases from the MaxEnt model, such 
that it can be used as an efficient alternative to swap randomizations. More 
importantly, we show how it can be used to define subjective interestingness of 
tile patterns with respect to prior information on the row and column sums of 
the database. In Sec. [5] we point out some interesting relations with literature. 
And in Sec. [7] we provide experiments demonstrating the efficiency and scalabil- 
ity as well as experiments to assess usefulness of the subjective interestingness 
measure for tiles. 



This paper significantly extends two unpublished technical reports (|De Bie 



2009allbh . the first one about the maximum entropy modeling approach, the 



second one primarily introducing a subjective interestingness measure of which 
the one in the present paper is a refinement. 
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2 Formalizing prior information, and subjective 
interestingness measures: a general approach 

In this Section, we introduce the MaxEnt modeling strategy at a general level 
(Sec. 12. ip . and outline ways in which such a probabilistic representation for 
prior information could be used to define subjective interestingness measures 
for patterns fSec. 12. 2| ). 

We should stress that in this Section, we have no intention to be overly spe- 
cific or focused on a particular type of data, prior information, or pattern type. 
Instead, our goal is to outline some general principles and ideas, centred around 
the formalization of prior information in a MaxEnt model. To become practical, 
these ideas need be developed and specified further in additional research, and 
we demonstrate this for a particular use case in the later Sections in this paper. 

2.1 The maximum entropy principle to model prior infor- 
mation 

Here we will introduce the maximum entropy principle and its use for modeling 
prior information in full generality. In Sec. [3] we will then apply this to the 
special case of rectangular databases and prior information on the row and 
column sums. 

Let X be any countable set0 Consider the problem of finding a probability 
distribution P over the data x <G X that satisfies a set of linear constraints 
implied by prior information. In particular, we will consider constraints of the 
form: 

^P(x)/i(x) = dt, (1) 

x 

where are real-valued functions of the data. Regarding these functions as 
properties of the data, these constraints could be the formalization of certain 
'expectations' (in both the formal and informal meaning of the word) of a data 
miner about these properties in the data. As mentioned earlier, we will give a 
specific example in Sec. [3j but for now we will focus on the implications of such 
a set of constraints on the shape of the probability distribution. 

In general, these constraints will not be sufficient to uniquely determine 
the distribution of the data. A common strategy to overcome this problem 
is to search for the distribution that has the largest entropy subject to these 
constraints, to which we will refer as the MaxEnt distribution. Mathematically, 

lr The results below can easily be extended for measurable sets as well, but we present them 
for countable sets for notational simplicity. 
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it is found as the solution of: 

max P(x) -^P(x)logP(x), (2) 

X 

s.t. ^]P(x)/ i (x) = d i , (Vz) (3) 

X 

^P(x) = l, (4) 

X 

where the last constraint e nsures that P ( x) is properly normalized. 

Originally advocated in iJavne 3 (Il957lfl98^ as a generalization of Laplace's 



principle of indifference, the choice for the MaxEnt distribution can be defended 
in a variety of ways. The most common argument is that any distribution other 
than the MaxEnt distribution effectively makes additional assumptions about 
the data that reduce the entropy. As making additional assumptions biases the 
distribution in undue ways, the MaxEnt distribution is the safest bet. 

A lesser kn own argument, but not less convincing, is a game-theoretic one 
( Tops0e . 1979| ). Assuming that the true data distribution satisfies the given con- 



straints, it remarks that the Shannon-optimal compression code (e.g. Huffman) 
designed based on the MaxEnt distribution minimizes the worst-case expected 
coding length of a message coming from the true distribution. Hence, using the 
MaxEnt distribution for designing a code is optimal in a robust minimax sense. 

Besides these motivations for the MaxEnt principle, it is also relatively easy 
to compute a MaxEnt model. Indeed, the MaxEnt optimization problem ((2][4|) is 
convex, and can be solved using stand ard techniques from convex optimization 



theory (jBovd and Vandeberghd . 120041 ). Let us use Lagrange multiplier \x for 
constraint ^ and Xi for constraints (O . Using A to denote the vector containing 
all Lagrange multipliers A;, the Lagrangian is then equal to: 

L(/x,A,P(x)) = -]TP(x) log P(x) (5) 

X 

Equating the derivative w.r.t. P(x) to yields the optimality conditions: 
logP(x) = /i-l + ^A^x), 



^« = !exp^>Mx)j , 



where we introduced a new variable Z = exp(l — fx). The normalization con- 
straint J2 X P(x) = 1 is often imposed constructively by choosing Z to be an 
appropriate function of the other Lagrange multipliers A, in particular: 



Z(X) 



£eaq>^X>/i(x)V (6) 



G 



The function Z(X) is known as the partition function. This leads to the final 
form of the MaxEnt distribution as a function of the Lagrange multipliers A: 



p(x) = z^ exp (E A ^( x )) 



(7) 



The resulting model is a member of the exponential family of distributions, 
such that all existing theory for this family of distributions can be used (e.g. 
Wainwright and Jordan , 20081 ). The optimal values of the Lagrange multipliers 



A can be found by minimizing the Lagrange dual objective. This Lagrange dual 
is obtained by substituting Eq. (J7J for P(x) in the Lagrangian (Eq. ([5])). After 
some algebra: 

L(A) = log(Z(A))-]TA^. (8) 

i 

Minimizing L(X) thus yields the values for the Lagrange multipliers and thus 
the MaxEnt distribution. In passing we note that it is easy to see that L(X) is 
equal to the negative log-likelihood of distribution from Eq. (J7J on data x that 
satisfies /,(x) = dj. Hence, the MaxEnt distribution subject to constraints ([3]) 
is equivalent to the maximum likelihood distribution of the form (JT)) fitted to 
data for which t he constraints ,/v(x) = dj hold deterministically rather than in 
expectation (e.g. Wainwright and Jordan! [20081) . 



The log-partition function log(Z(A)) is well-known to be a convex function, 
such that L(X) is convex as well. Thanks to this, it turns out that minimizing 
L(X) can be done efficiently for a broad class of constraints, using standard tech- 
niques for convex optimization (see Sec. 13.41) or using special purpose techn iques 
such as Iterative Proportional Fitting (e.g. IWainwright and Jordanl 2008 ). 



A full discussion of the efficiency of the optimization of the Lagrange multi- 
pliers in the most general form of the problem is beyond the scope of this paper. 
Let us just point out that many results from the graphical models literature can 
be borrowed to establish tractability results. Rather than staying at the general 
level, we choose to fully explore a specific type of data and prior information of 
particular interest to data mining in Sec. [3] below. 



2.2 Using the MaxEnt model to define subjective inter- 
estingness measures 

Given a representation of prior information in the form the MaxEnt distribution, 
subjective intcrcstingncss of a pattern can be quantified by contrasting it with 
the MaxEnt distribution. This can be done by computing some measure of 
unexpectedness of the pattern e.g. using hypothesis testing, or by relying on 
information theory. Here is a non-exhaustive list of possibilities: 

1. Self-information. A first option for quantifying interestingness of a pattern 
relies on the probability of the pattern under MaxEnt model. The smaller 
this probability is, the more surprising the pattern when contrasted with 
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the prior information. Equivalently, the negative log-probability can be 
used, which represents the coding length required to encode the pattern 
with a Shannon optimal code with respect to the MaxEnt model. In 
Shannon's information theo ry, such a negative l og-pro bability is known 
as the self- information (e.g. Cover and Thomas! 199ll ). The larger this 



quantity, the more information the pattern contains. Interesti ngly, in 



therm odynamics the self-information is also known as surprisal (jTribusl . 
E2H2). 

2. Information compression ratio. The self-information of a pattern does 
not take into account the complexity of describing or communicating the 
pattern to the data miner. This complexity could be formalized as the 
description length of the pattern in a code that assigns longer code- words 
to patterns that are perceived as more complex by the data miner. Then, 
we can define a subjective interestingness measure as the ratio of the self- 
information of the pattern given the MaxEnt model and the pattern's 
description length. This would correspond to some kind of compression 
ratio: how much information is compressed in the description of a pattern? 

3. P-value A third option is the probability of the pattern or a stronger 
instantiation of the pattern to be present in the data, with respect to 
the MaxEnt model as n ull hypothesis. This probabili ty is known as a 
p-value in statistics (e.g. Lehmann and Romanol . Il995l ). and computing 



the p-value is at the core of hypothesis testing. With the MaxEnt model 
as null hypothesis, patterns with a small p-value are then those that are 
maximally surprising given the prior information embedded in it. Hence, 
in a certain well-defined sense these will be unexpected to the data miner. 

4. P-value based on the likelihood ratio test. To use the previous approach, 
a notion of pattern strength needs to be chosen as a test statistic, such 
as the frequency of an itemset. Perhaps a more principled approach is 
by relying on the ratio of the probability of the data under the MaxEnt 
model, versus the probability of the data under an augmented model that 
is corrected for the presence of the found pattern. The augmented model 
can also be found using MaxEnt, with an additional constraint for the fact 
that the pattern is there. Based on this likelihood ratio, a p-value can be 
computed usi ng a likelihood ratio tes t , if ce rtain regularity conditions are 
satisfied (e.g. Lehmann and Romanol 19951 ). 



It seems hard to discuss any of these approaches more formally without being 
more specific about the particular type of pattern concerned. For this reason, in 
Sec. 15. 21 wc will demonstrate the second option (Information compression ratio), 
which we regard as particularly promising, for the particular of tiles in binary 
databases. Working out the details of the other approaches, and the connections 
between them, could be the subject of further research on the topic of subjective 
interestingness . 
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3 MaxEnt Distributions for Rectangular Databases 



In the rest of the paper, we will elaborate on the details of the outlined general 
approach for specific types of data, prior information, and patterns. In the 
current Section, we will apply the general MaxEnt modeling strategy to the 
important case of rectangular databases. To this end, we will cast the prior 
information in the general form of Eq. ([1]). We will investigate the specific form 
of the resulting MaxEnt model, and show how it can be fitted in a remarkably 
efficient way. 



3.1 Notation 

In the rest of this paper, we will denote the database using the matrix D with 
to rows and n columns. To maintain generality, we will assume that all matrix 
values belong to some specified set T> C R+, i.e. T)(i,j) € T>. Later we will 
choose the set V to be the set {0,1} (to model binary databases), the set of 
positive integers (to model integer- valued databases), or the set of positive reals 
(to model real- valued databases). Other choices can be made, and it is fairly 
straightforward to adapt the derivations accordingly. For notational simplicity, 
in the subsequent derivations we will assume that T> is discrete and countable. 
However, if T> is continuous the derivations can be adapted easily. 



3.2 Swap randomizations and prior information for databases 



For binary databases, it has been argued that row and column sums can of- 
ten be assumed as prior inforrnationjj Any pattern that can be explained by 
referring to row or column sums in a binary database is then deemed uninfor- 
mative. Previous wo rk has introduced ways to assess data mining results base d 
on this assumption (jGionis et all . l2007t iMannilal . 2008; H anhiiarvi et al , l2009h . 
These methods rely on the ability to sample random databases from the uniform 
distribution over all databases that satisfy the prior information. To assess a 
frequent itemset in the given database, they then compute the empirical p- value 
as a subjective interestingness measure, defined as the fraction of the random 
databases in which the itemset is at least as frequent as in the given database. 

Unfortunately, sampling from this uniform distribution cannot be done in 
a direct way. To overcome this, the authors randomize the given database 
by iteratively applying elementary randomization operations: so-called swap 

randomizations that transform any 2x2 submatrix of the form 



1 



into 



1 

1 

leave the row and column sums invariant 



(See Fig. Q] for a graphical illustration.) Clearly, such operations 

Furthermore, iGionis et all (|2007h 



showed how the limit distribution of a Markov chain of random swap operations 
is equal to the uniform distribution over all databases with the specified row and 



2 We refer to IGionis et a for a detailed argumentation, and to the experiments 

Sec. !7.3l of this paper for a particular use case 
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Figure 1: The effect of a swap operation to a binary database. 



column marginals. Hence, one can approximately sample from this distribution 
by running this Markov chain for a sufficiently long time (although there are 
no theoretical results on convergence rates) . The swap operation has later been 
generalized to deal with real-valued databases as well (|Oiala et a j l2008h . and 
we will get back to this in Sec. HJ 

The models we will develop in this paper are based on exactly these invariants 
of the row and column sums, be it in a somewhat relaxed form: we will assume 
that the expected values of the row and column sums are equal to specified 
values. Mathematically, this can be expressed as: 



Den 




E p (P) 

Dex , mxn 

where d\ is the j'th expected row sum and dj the j'th ex pected column sum . 
Although they have been suggested for bin ary databases dGionis et al . 2007 ). 



and later extended to real- valued databases (jOiala et all . 120081) , we will explore 



the consequences of these constraints in broader generality, for various choices 
for V. 
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Importantly, it is easy to verify that these constraints are exactly of the type 
of Eq. ([T]), such that the MaxEnt formalism is directly applicable. 



3.3 MaxEnt matrix distributions with given expected row 
and column sums 

The MaxEnt distribution over the set ofmxn matrices D subject to constraints 
on the expected row and column sums is thus found by solving: 

max P(D) -]TP(D)log(P(D)), 

D 

s.t. £P(D) [ 5^D(*,j) 
d V i j 

$>(D) ^D(i,j)J 

E p ( D ) = 1 - 



(9) 

(10) 
(11) 



As shown in Sec. 12.11 the resulting distribution will belong to the exponential 
family, and will be of the form of Eq. ([7]). Using Lagrange multipliers \\ for 
constraints © and for constraints (fTU|) this yields: 



P(D) = 



1 



Z(A r ,A c 
1 



■ exp 



Y^k fc D (^)) +E A ? (e d (^)) m 



exp 



Ed( 1 , j )(ai + a5 



Z(A * AC) n ex p + ^ . 



(13) 



where Z(A r , A c ) is the partition function, X r is the vector of Lagrange multi- 
pliers XI for constraints ©, and A c the vector of Lagrange multipliers for 
constraints (|TU| . 

Let us have a more detailed look at the partition function. Following Eq. ©, 
it is equal to: 

Z(A r ,A c ) = e n ex P( D ^J')(^ + A J C ))' 
= II E exp(D(i,i)(AI + A5)) ! 

i,j D(ij')eC 

= 
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where Z(\V, Xf) = En^go ex P (^(hMK + A?))- 
Plugging this into Eq. (fT5)) yields: 

p ( D ) = Uz(^) exp(p{i ' m + X ^- 

This is a product of exponential family distributions for each of the elements in 
the matrix D. The partition function Z(X r , A c ) is the product of the partition 
functions Z(A£,A|) of each of these individual distributions. Thus, we have 
proved the following Theorem. 

Theorem 1. The MaxEnt distribution for matrices D 6 2? mx ™ subject to con- 
straints on the expected row and column sums is of the form: 

p(d) = n p «( D (^'))< ( m ) 

where 

PidWJ)) = z{x l AC) exp(D(z,j)(A[ + A J c )) (15) 

is a properly normalized probability distribution for the matrix element D(i,j) 
at row i and column j. Hence, the MaxEnt model factorizes as a product of 
independent distributions for the matrix elements. 

It is important to stress that we did not impose independence at the outset. 
The independence is a consequence of the MaxEnt objective. 

Various particular choices for T> will lead to various distributions, with ap- 
propriate values for the normalization constant Z(A[,A|). Let us discuss three 
examples in detail. 

Bernoulli For T) = {0, 1}, the partition function 

Z(\r,\<) = 1 + exp (XI + A?) , 

such that: 

( cxpfo+A?) = 

p«(D(i,i)) = i+°xp(Af + A|) ' ' (16) 

1 l+cxp^ + Aj) ' lfD (^) =0 - 

This means that for V = {0, 1} the MaxEnt distribution for D reduces to 
a product of independent Bernoulli distributions, with probability of success 

equal tO Pij = ^cxp^+A') for D ( z 'i)- 
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Table 1: Three possible domains for the elements of D, the corresponding parti- 
tion functions in the MaxEnt distribution P(D) for the matrix element 
and the resulting type of distribution for the matrix elements. 



V 


Distribution 




Parameter: Value 


{0,1} 

N 
R+ 


Bernoulli 
Geometric 
Exponential 


l + cxp(A[ + A 3 c ) 
1 

l-exp(Aj;+A5) 
AJ' + A- 


success prob.: 1+oxp(Af+ 3 A c ) 
success prob.: 1 — exp(A[ + A|) 
rate param.: — (X\ + Xj) 



Geometric For V = N, the partition function 

oo 

ZW,\$ = ^exp(fc(A[ + AJ)), 

k=0 

1 

1 - exp(A[ + Ap ' 

assuming that A[ + X'j < to ensure convergence of the sum. Thus: 

Py(D(t, j)) = [1 - cxp(A[ + A, c )] • exp (D(i, j)(A£ + A*)) , 

= [1 - cxp(A[ + A, c )] • [exp(A[ + A|)] D(iJ) . 

This means that for T> = N the MaxEnt distribution for D reduces to a product 
of independent geometric distributions, with probability of success equal to 1 — 
exp(A[ + Xj) for the matrix element D(i, j). 

Exponential For T> = R + , the partition function 

/>oo 

Z(XIX-) = / exp (x(X r l + Xj)) dx, 
Jo 

1 

~AfTA^' 

assuming that A[ + A^ < to ensure convergence of the integral. Thus: 
Pij(P(i,j)) = -(A[ + A J c )-exp(D( i ,j)(A[ + A J c )). 

This means that for V = R the MaxEnt distribution for D reduces to a product 
of independent exponential distributions, with rate parameter equal to — (A[ + 
Xj) for the matrix element T)(i,j). 

These results are summarized in Table [1] 

3.4 Optimizing the Lagrange multipliers 

We have now derived the shape of the models P(D), expressed in terms of the 
Lagrange multipliers, but we have not yet discussed how to compute the values 
of these Lagrange multipliers at the MaxEnt optimum. 
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In Sec. 12.11 we outlined the general strategy to do this: the optimal values for 
the Lagrange multipliers are found by minimizing the Lagrange dual as given 
by Eq. ((5J). For concreteness, let us go through the mathematical details for 
the case of a rectangular binary matrix: D g {0, l} mxn . There should be no 
conceptual difficulties in adapting the derivations below for other choices of T>. 
and for conciseness these adaptations are omitted from this paper. 

For D € {0, 1}™ IX ™ J the Lagrange dual from Eq. JSJ is equal to: 

L(X r , A c ) = log(Z(A r ,A c ))-^A[<-^A^. 

i 3 

Using Z(A r , A c ) = l\ t J Z{\\, Xf) and Z{\\, AJ) = 1 + cxp(A[ + X]), this gives: 
L(A r , A c ) = £log(Z(A?,A^)-£A^-£A^, 

i,j i 3 

= lo s i 1 + e MK + a, c )) - £ xidi x 3 d 3- 

i,3 i 3 

The optimal values of the parameters are easily found using standard meth- 
ods for unconstrained convex optimization such as N ewton's method or (conju- 
gate) gradient descent, possibly with a preconditioner ( Shewchuk . 1994 : Bovd and Vandeberghd 



20041 ). We will report computational results for two possible choices in Sec. [7J 
Gradient descent type methods rely on the gradient of L, while Newton's method 
relies on the gradient as well as the Hessian. Both can easily be computed an- 
alytically. The gradient is determined by the first order partial derivatives: 

dL_ = v exp(A[ + Aj) _ 
dXr 2-i + exp(Ar + A?) l ' 

J J 

dL x exp(A[ + A|) 

dX^ ±f l + exp(A[ + A^) 



4 



Note that these derivatives have a natural interpretation. Indeed, the sum 
i+cxp(A' h +v=) * s ecma l to the expected number of ones in the ith column for 
the distribution with the current parameter values, and the partial derivative 
Jj^t is equal to the difference between that expected number and the value d\ 
it needs to be as required by the constraints. The Hessian is determined by the 



14 



second order partial derivatives, given by: 

if i ^ k , 

, 

^ cxp(Ar + A|) 

1 (l + exp(A[ + Ap) 2 ' 
^ cxp(A[ + \f> 
V(l + cxp(A^ + Ap) 2 ' 

exp(Ar + A 3 C ) 
(l + exp(A[ + Ap) 2, 

The number of Lagrange multipliers to be optimized over, which is crucial for 
the computational cost of e.g. Newton iterations, is equal tom + n. While this 
is sublinear in the size of the data mn, it is still a daunting number for practical 
sizes of databases. However, the computational and space complexity can often 
be further reduced, in particular when the numbers of distinct values of d\ and 
of dj are small. Indeed, thanks to symmetry and convexity of L, if d\ = d r k for 
specific i and k the corresponding optimal values of the Lagrange multipliers 
A[ and A£ will be equal as well, and the same goes for the elements of A c . In 
practice this allows one to drastically reduce the number of free variables, down 
to the sum of the number of distinct expected row sums d\ and the number of 
distinct expected column sums dj. 

Especially for V = {0, 1}, almost in all practical cases this allows for a 
massive reduction in computational complexity. The number of distinct row 
and column sums can be upper bounded in terms of the dimensions of D, the 
number of non-zero elements in D, and the largest row and column sums in D, 
as quantified by the following Lemmas. 

Lemma 1. In a binary matrix D G {0, l} mxn ; the number of distinct row sums 
m is upper bounded by min(m,n + 1) and the number of distinct column sums 
n is upper bounded by min(m + l,n). 

Proof. Let us prove it for row sums only. Clearly the number of distinct row 
sums is bounded by the total number of rows m. On the other hand, the 
only possible values of the row sums are 0, 1, . . . ,n, a total of n + 1 distinct 
values. □ □ 

Lemma 2. In a binary matrix D <G {0, l} mxn w ith T)(i,j) = s (i.e. with 
s ones), the number of distinct row sums rh is upper bounded by \/2s, and the 
same holds for the number of distinct column sums n. 

Proof. Let us prove this bound for the number of rows by contradiction. Assume 
that the number of different row sums is larger than y/2s. This means that the 



d 2 L 
d 2 L 
8 2 L 

w 

d 2 L 

dXf 
d 2 L 



15 



number of ones in D is at least * = ^ + 2 ^ = s + ^/f , if the distinct 

row sums are 0, 1, ... , y/2s and no row sum is equal to another except those 
equal to 0. Since s + > s, the assumption is incorrect and the number of 
different row sums cannot be larger than \/2s. □ □ 

Lemma 3. In a binary matrix D e {0, l} mxn with largest row and column 
sums equal to d r max and d c max respectively, the number of distinct row sums fa 
and the number of distinct column sums n are upper bounded by d r max + 1 and 
dmax + 1 respectively. 

Proof. This follows directly from the fact that row and column sums are integers 
larger than or equal to and at most equal to d^^ and respectively. □ 

□ 

Combining these Lemmas yields the following Theorem. 

Theorem 2. In a binary matrix D <E {0, l} mx " with • T)(i,j) — s (i.e. 
with s ones), and with largest row and column sums equal to d r max and d c max 
respectively, the following inequalities hold for the number of distinct row sums 
m and the number of distinct column sums n: 

m < min jm, n + 1, \/2s, d r max + lj , 
n < min jm + l,n, y/2s, d° max + lj . 

For V the set of integers, similar bounds can be obtained. 

Let us thus partition the rows in groups of equal value of d\, and for the fc'th 
group corresponding to expected row sum d r k let us use the Lagrange multiplier 
X k , with rhk denoting the number of rows in that group. Similarly we can define 
d k , Xf and n;. Then we can express the Lagrange dual as: 

L{\ , \ C ) = ™kni log (l + cxp(A£ + Xf )) - ™kX r k d r k - mXfdf. 

The gradient is easily calculated to be given by: 

dL -r-^ „ „ exp(Al + A?) „ ~ 

d\ r k , 1 + exp(A' fe + Xf) 

dL _ exp(A£ + Af) 



= > TO fe n/ ^4 n ; rf 

9A? V l + ex P (A^ + Af) 



and the Hessian elements are computed in a similar way. 

The computational complexity of computing the gradient as well as the 
Hessian are 0(mn). Applying Newton's method then requires solving a linear 
system of fh+n equations, with computational complexity 0((rh+n) 3 ) although 
more efficient approximation methods such as conjugate gradient can be used 
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(e.g. iBovd and Vandeberghd . 120041) . Combining this with Theorem [5] yields 



an overall worst-case complexity of at most 0(vs 3 ) per iteration for Newton's 
method and at most O(s) for first order methods such as gradient descent. The 
space complexity for Newton's method is determined by the size of the Hessian, 
such that it is bounded by O(s) and thus of the order of the size of the database 
in sparse representation. For gradient-based or conjugate gradient methods, it is 
bounded by the size of the gradient 0(y / s). As we will see in the experiments, 
these results make the MaxEnt approach amenable for practical problems of 
very large scale. 

The reader may be left with one concern: the fact that the MaxEnt model 
is a product distribution of independent distributions for each D(i,j) seems 
to suggest that parameters need to be stored for each of these m x n element 
distributions. However, it should be pointed out that one does not need to store 
the value of A[ + A^ for each pair of i and j. It suffices to store just the A[ and 
A^ to compute the probabilities for any T)(i,j) in constant time. Hence, the 
space required to store the resulting model is 0(m + n), sublinear in the size of 
the data. 



4 The Invariance of the MaxEnt Matrix Distri- 
bution to 5-Swaps 

The MaxEnt models introduced in Sec. |3] are explicitly represented probability 
distributions. As a result, they are useful for defining analytically computable 
measures of interestingness, as outlined in Sec. 12.21 and we will demonstrate 
this by designing a concrete interestingness measure in Sec. 15.21 Still, it is 
instructive to point out some relations between our MaxEnt models and the 
previously proposed swap randomization approaches and generalizations. 



4.1 5-swaps: a randomization operation on matrices 

First, let us generalize the definition of a swap as follows. 

Definition 1 (5-swap). Given an my. n matrix D, a S-swap for rows i, k and 
columns j,l is the operation that adds a fixed number 5 to D(i,fc) and T)(j,l) 
and subtracts the same number from D(i,l) and D(j, k). 

Of course, for a 5-swap to be useful, it must be ensured that D(i, j) + 
S, D(fc,j) — S, D(i,Z) — 6, D(k,l) + 8 € T>. We will refer to such 5-swaps as 
allowed (5-swaps. 

Definition 2 (Allowed 5-swap). A 5-swap for rows i, k and columns j, I is said 
to be allowed for a given matrix D over the domain T> iffT)(i,j) + <5, D(fc, j) — 
5,~D(i,l) - 5,T>(k,l) + 5 eV. 

Clearly, an allowed 5-swap leaves the row and column sums invariant. The 
following Theorem is more interesting. 
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Theorem 3. The probability of a matrix D under the MaxEnt distribution 
subject to equality constraints on the expected row and column sums is invariant 
under allowed 5-swaps applied to D. 

Proof. It is easily verified from Eq. (fT5)) that: 

J\ i (D(i,i))-P j ,(D(i,0)-flfe i (D(fc,i))-fl H p(fc,0) 
= Pij(D(iJ) + S) ■ P u (D(i,l) - 6) ■ P kj (D(k,j) - S) ■ P M (D(k,l) + S) 

for any S, rows i, k and columns j, I. □ □ 

This means that for any 2x2 submatrix of D, adding a given number to its 
diagonal and subtracting the same number from its off-diagonal elements leaves 
the total probability of the data under the MaxEnt model invariant. 

More generally, the MaxEnt distribution assigns the same probability to any 
two matrices that have the same row and column sums. This can be seen from 
the fact that Eq. (fT2|) is independent from D as soon as the row and column 
sums J2-T)(i,j) and J^D^,,?) are given. In statistical terms: the row and 
column sums are sufficient statistics of the data D for the MaxEnt distribution. 
We can formalize this in the following Theorem: 

Theorem 4. The MaxEnt distribution for a matrix D, conditioned on con- 
straints on row and column sums of the form 



1 1 



= d 

3 
i 

denoted as P(D|^ -D(i,j) = d\, T)(i,j) = dj), is identical to the uniform 
distribution over all databases satisfying these constraints. 

This Theorem further clarifies the connection between the uniform distri- 
bu tion over al l matr ic es with fixed row and column sums, as sampled from 
in Gionis et al ( 2007 ): Oiala et al ( 20081) using swap randomizations, and the 



MaxEnt distribution. 



4.2 Special cases of 5-swaps 

The invariants that have been used before in computation intensive approaches 
for defining null models for databases are special cases of these more generally 
applicable (5-swaps. 

For binary databases the condition T)(i,j)+S, D(fc, j) — S, D(i, I) — 6, D(fc, l) + 

5 6 V corresponds to the fact that either S = — 1 and D(i, k;j, I) = ( q ^ ^ , or 

5 = 1 and D(i, k;j, I) = ^ ^ / Then, the <5-swap is identical to a swap in a 
binary database. This shows that the MaxEnt distribution of a binary database 
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is invariant under swaps as denned in iGionis et all ([20071 ) . For positive real- 
va lued databases, th e 6-swap operations reduce to the Addition Mask method 
in lOiala et al (|2008l ). 



5 Using the MaxEnt model: Randomizing Databases, 
and Subjective Interestingness of Tiles 

In this Section we will describe how the MaxEnt model from Sec. [3] allows one 
to take prior information effectively into account in the data mining process, for 
concreteness focusing on binary databases. First we show how it can be used 
to randomize databases highly efficiently, such that it is a fast alternative to 
swap randomizations. Subsequently, wc define a new interestingness measure 
for tiles in binary databases when contrasted with prior information on the row 
and column sum, based one of the ideas presented in Sec. 12.21 



5.1 Randomizing binary databases 

The (J-)swap operations discussed in the previous Section, being simple invari- 
ants of the MaxEnt distribution, can be used for randomizing any of the types of 
databases discussed in this paper. This being said, it should be reiterated that 
the availability of the MaxEnt distribution should make randomizing the data 
using <5-swaps unnecessary. Should it be needed to generate randomizations of a 
given database, one can instead sample directly from the MaxEnt distribution, 
thus avoiding the computational cost and potential convergence problems faced 
in randomizin g the data. The thu s randomized databases can be used exactly 



as proposed in lGionis et all (|2007l ) for the assessment of data mining results. 

A randomized binary database can be sampled directly from the MaxEnt 
model by looping through all database entries and sampling a Bernoulli ran- 
dom variable with success probability i^-(D(i,j) = 1). The complexity of this 
approach is 0(mn) — prohibitive for large sparse databases. 

Fortunately, a faster approach exists, based on the observation that the 
number of experiments between two successes in a series of Bernoulli experi- 
ments is geometrically distributed. We can sample a sparse representation of a 
large number of Bernoulli random variables by sampling these so-called inter- 
arrival-times from the geometric distribution. In this way, the time required 
is proportional to the number of successes in the set of Bernoulli experiments, 
rather than to the total number of Bernoulli experiments. 

This approach can only be used if all Bernoulli random variables have the 
same success probability, which is not true for the success probabilities Pij (D(i,j) 
1) of the entries under the MaxEnt model. However, two database entries will 
have a different success probability only if either their column sums or their 
row sums (and thus the associated Lagrange multipliers) are different. From 
Lemma [21 it is immediate that the number different combinations of row and 
column sums is ran < (\/2s) 2 = 2s, i.e. at most proportional to the number of 
non-zeros in the original matrix. 
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Putting things together, this means that a random database can be sampled 
from the MaxEnt model using a double for-loop over all distinct Af and all dis- 
tinct A^, with in total at most 2s combinations. For each of these combinations, 
all entries in intersections of the rows and columns with these Lagrange mul- 
tipliers can be sampled efficiently using the geometric distribution as outlined 
above. 

The total complexity thus consists of two components: sampling the geo- 
metric distribution, and the overhead of looping over all combinations of row 
and column Lagrange multipliers. The latter clearly has a complexity bounded 
by s. The former has a complexity proportional to the number of non-zeros 
in the sampled matrix, which is proportional to s in expectation and tightly 
concentrated around it. Hence, the expected complexity is O(s) with the actual 
complexity tightly concentrated around this. 

We should point out that sampling from the MaxEnt model cannot be used 
as a substitute for swapping if the row and column sums need to be preserved 
exactly rather than in expectation. This may be the case for categorical data 
represented by a binary matrix where each column corresponds to an attribute- 
value. However, in that case a MaxEnt model can be fitted on the categorical 
representation of the data. Then the constraints will not be on the row and 
column marginals, but on the number of times each of the attribute values is seen 
for each of the attributes. Without going into detail, the MaxEnt distribution 
would then be a product of categorical distributions (one for each database 
entry), rather than a product Bernoulli distributions. 



5.2 The MaxEnt model to define interestingness of tiles 

The above shows that the MaxEnt model can be used as an alternative to 
swap randomizations for the generation of randomized versions of databases. A 
comparison with swap randomizations for this purpose of randomizing databases 
can therefore be made, and the empirical results reported in Sec. [7] show that 
the MaxEnt model allows one to generate randomizations more efficiently than 
using the swap randomizations strategy. 

However, what is more important is that the explicit analytical nature of 
the MaxEnt model allows one to use it in situations where swap randomiza- 
tions would be impractical, such as for defining new and subjective measures of 
interestingness of patterns. 

To demonstrate the use of the MaxEnt model for this purpose, we here 
work out a specific example. In particular, we will focus on binary da tabases 
D G {0, l} mx ™ and a kind of pattern known as a tile (jGeerts et a that is 



denoted as r and defined as an ordered pair of a set of rows 7 C {1, . . . , m} and 
a set of columns J C {1, . . . , n}, i.e. r = (I, J). We say that a tile r = (I, J) 
is present in the database D, denoted as r € D, iff D(z, j) = 1 for all i € I 
and j G J. Furthermore, we say that the database entry at row i and column 
j is contained in a tile r = (7, J) iff i € 7 and j G J, and we denote this more 
concisely as G t. 
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Below we will define a measure of intcrcstingness for tiles and extend these 
ideas also to sets of tiles. Our approach is based on the second option in Sec. 12.21 
computing the compression ratio of information embodied by the statement that 
a tile is present in the database. In order to quantify this, we need to quantify 
two things: the self- information of a tile pattern with respect to the MaxEnt 
model representing the prior information, and its description length representing 
its complexity as perceived by the data miner. 



5.2.1 The self-information of a tile 

Let us try to intuitively quantify the amount of information conveyed to a data 
miner if he is told about the presence of a tile in the database. We argue that it 
could be formalized by the prior belief the data miner had about the presence 
or absence of the tile. The most natural way of formalizing this is to use a 
background distribution representing the data miner's prior expectations, and 
to compute the probability Pr(r £ D) of the tile-pattern under this distribution. 
The smaller Pr(r € D), the more information this tile-pattern contains. 

A more convenient way of quantifying this is as the negati ve log-probability, 
know n as the self- information in Shannon's information theory ( Cover and Thomasl . 



19911): 

Selflnforniation(r) = - log(Pr(r G D)), (17) 

where the probability is taken with respect to the background distribution. If 
a pattern is more interesting as its probability is smaller, it is equivalently 
more interesting as its self-information is larger, since minus the logarithm is a 
monotonically decreasing function. 

The self-information is the number of bits (if a base 2 logarithm is used) that 
is required to encode a particular outcome of a random variable in a Shannon- 
optimal code. Here that random variable is the indicator variable indicating 
presence or absence of the tile in the database. Besides its useful interpretation 
as a code length, the self-information has an important practical advantage 
over the probability of the presence of the tile as a measure of information 
content: the logarithm maps extremely small probabilities to numerically more 
manageable values. 

For the MaxEnt model subject to row and column sums, the self-information 
of a tile pattern can be computed very conveniently by relying on Eqs. (|14lll6j) : 

Sclflnformation(r) = — V, l°g(Py)i (18) 



where 



1 + exp (A[ + \ c -) 

I.e., the self-information is equal to the sum of the negative log-probabilities 
that the database entry T)(i,j) = 1, summed over all row-column pairs in 



21 



the tile. The fact that it reduces to a simple sum is due to the independence of 
the database entries D(z, j) under the MaxEnt distribution. 



5.2.2 The description length of a tile 

A data miner is never merely interested in receiving as much information as 
possible. Indeed, the best way to achieve this would be to communicate the 
entire database to the data miner, which would be of little use. Instead, a data 
miner will be interested in hearing about patterns that convey new information 
as concisely as possible. 

To quantify this, we also need to consider the inherent description length 
of the pattern, in this case the tile r = (I, J), or equivalently its set of rows 
I and set of columns J. A sensible way to describe a set of rows I would be 
to assume a probabilistic model in which rows occur independently in I, each 
with a certain probability p (see below for more details) . Then the description 
length for the set I under a Shannon-optimal code with respect to this model 
is given by: 

DescriptionLength(I) = — ^ log(l — p) — ^ log(p) 

igl iel 

= -(to- |i])log(l -p) - \I\log(p), 
= |7|log(- — -J + to log' 



p J \l-p 



Doing the same for the set of columns J and combining both descriptions, the 
description length for a tile r = (I, J) is given by: 

DescriptionLcngth(r) = DescriptionLength(J) + DescriptionLength( J), 

= (|/| + |J|) log (^)+ (m + n) log (^-^20) 

This means that the description length of a tile is equal to its circumference 
|/| + | J | times a constant, plus another constant term, which makes intuitive 
sense as a model for the perceived complexity of a tile. 

The probability parameter p can be set by the data miner to bias the search 
toward larger or toward smaller tiles. Indeed, if p is small, the constant com- 
ponent (to + n) log (iZjj) °f a description length is small while the vari- 
able component (|7| + |J|)log {^-f-^j i s large, thus yielding a short description 
length for tiles with a small circumference as compared to large ones. In our 
experiments, we set it equal to the probability that D(i,j) = 1 for i and j 
sampled uniformly at random, which is equal to the density of the database 
P= — V D- 
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5.2.3 The compression ratio of a tile as interestingness measure 

The interestingness of a tile can now be quantified as the ratio with which 
information is compressed in the tile pattern: 

_ . . , . Selflnformation(r) , , 

CompressionRatio(T) = rT^r- (21) 

DescnptionLengtn(r) 

This ratio expresses the number of bits of information received by the data 
miner (with respect to the MaxEnt model), per bit received to describe the 
tile r. Tiles that have the largest compression ratio are thus most efficient at 
communicating aspects of the data the data miner did not expect a priori. 

5.2.4 Finding interesting sets of tiles 

It is well-known that the set of individually most interesting patterns is often 
not the most interesting set of patterns, regardless of whi ch interestingness 



measure is used (see e.g. iDe Raedt and Zimmermannl ([20071 ) ). This is due to 



redundancies between the patterns that are individually interesting. So the 
question arises if we can also use the above tools to define the interestingness 
of a set of tiles T = {ti, . . . , tjv}. It turns out the approach generalizes easily. 
Furthermore, it yields an additional formal argument in favour of using the ratio 
of the self-information and description length as interestingness measure for an 
individual tile. 

Describing a set of tiles requires one to describe each of the tiles in the 
set. Hence, the description length of a set of tiles is quantified by the sum of 
the description lengths of each of the tiles individually. Slightly overloading 
notation, for a set of tiles T — {t%, . . . , Tat}: 

DescriptionLength(T) = DescriptionLength(ri). 

i=l:N 

The self-information of a set of tiles is generalized as the negative log- 
probability that all tiles in the set are present in the database. Due to the 
independence of the database entries under the MaxEnt distribution, this is 
equal to the sum of the negative log-probabilities that D(i,j) = 1 for all the 
database entries belonging to some tile t 6 T. Formally: 

Selflnformation(T) = — /J l°g(p«)) 

(i,j):3r£T with (i,j)Gr 



with pij as defined in Eq. (|T9[) . 

In practice, we argue that a data miner has a bounded capacity of taking 
in and processing patterns. Given this capacity, the data miner would like to 
receive as much information as possible. In this setting, tile set mining can be 
formalized by the following optimization problem: 

max Selflnformation(T) , 



s.t. DescriptionLength(T) < 



u 
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for some upper bound u, representing the data miner's capacity. 

Interestingly, this pro blem can be r e duced to the (weighted) budgeted max- 
imum coverage problem (jKhuller et al , 1999), which is a weighted variant of 



the maximum set coverage problem. In that problem, a universe of elements is 
given and with each element a weight is associated. Furthermore, a collection 
of subsets of the universe is given, each of which has a specified cost. The task 
is to select a set of subsets from the collection so as to maximize the sum of the 
weights of the elements in the union of these selected subsets, while respecting 
an upper bound on the sum of the costs of the selected subsets. 

To reduce our tile mining problem to the budgeted maximum coverage prob- 
lem, the elements in the universe are the database entries that are equal to 1. 
The collection of subsets is given by the collection of all tiles present in the 
database. The weight of the database entry at position is equal to the 

contribution it makes to the information content of a tile containing it, equal to 
— log(pij). And the cost of a tile is equal to its description length. 

The budgeted maximum coverage problem is a hard combinatorial problem. 
Fort unately, it can be a pproximated well by using an efficient greedy algorithm 
(see Khuller et al (1999) for details). The criterion to greedily select a tile for 



inclusion as k'ih tile in the set is the ratio of the sum of the weights — log(pij) 
of database entries E not yet contained in earlier selected tiles, versus 
the description length of r^. Formally, with 

SelfInformation + (r fc ) = — log(pij)> 

(i.j)eTjt and (i,j)^Tj:/<fe 



in iteration k of the greedy algorithm selects the tile t/~ maximizing the CompressionRatio (t^) 
Hows: 

CompressionRatio" 1 " {ru ) 



defined as follows: 

SelfInformation + (r^) 



DescriptionLength(rfe) ' 

In the first iteration, this selection criterion coincides with the interestingness 
measure CompressionRatio(r) as defined earlier, thus corroborating our choice 
for the this ratio as interestingness measure. 

Note that upon selection of a tile in iteration k, the CompressionRatio + (r) 
of any other yet unselected tile can only decrease. This can be exploited by the 
algorithm by keeping all yet unselected tiles in a sorted list, sorted according to 
their last updated value of SelfInformation + (T). Then, to select the next best 
tile, the tile at the top of this list is considered and the updated value of its 
CompressionRatio" 1 " (r) is computed. If this value is still larger than the subse- 
quent tile in the list, we can be sure it will remain so even after all subsequent 
ones are updated too, and it can be selected as the k + l'st tile. Otherwise, it 
must be inserted in the list to keep it sorted, and the second t ile in the list is 
considered. This powerful idea was first introduced in iMinoux (Il978l) . 



The approximation quality of this greedy algorithm is such that any set 
of k top-ranked tiles in this list has a self-information that is at least 1 — - 
times the maximum self-information that can be achieved by a set of tiles with 
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a description length that is not longer. Note that this means that the upper 
bound u on the total description length of the set of tiles does not need to be 
specified in advance. A data miner can keep querying for the next tile in the 
list until satisfied, and be sure that all tiles seen so far constitute a tile set that 
conveys near to the maximum amount of information given its total description 
length. 



6 Discussion 



In this Section we point out how the MaxEnt modeling strategy from Sec. [3] 
can be used almost directly for modeling network adjacency matrices, and we 
discuss some relations with the data mining and random networks literature. 



6.1 Networks adjacency matrices as a special case of rect- 
angular databases 

Networks can be represented using their adjacency matrix. A swap operation 
applied to this adjacency matrix corresponds to swapping a pair of edges i — ¥ 
j and k — > I, yielding a new pair of edges i — > I and k — > i. Such edge 
swap operations preserve the in- and out-degree of all nodes. They have been 
introduced and used for the statistical assessme nt of network pa tterns, similar 
to the use of swap randomizations in databases ( Milo et al . 20021) . 



All theory developed this paper for rectangular databases can be applied 
with minor changes to various types of networks. They can be unweighted or 
weighted, directed or undirected (using a symmetricity constraint), with and 
without self- loops (by constraining the diagonal to contain zeros only). 

We will not go into further details here. However, because of the importance 
of networks, in Sec. [7] we will report some empirical results on networks as well. 



6.2 Related literature 

In the Sec. 11.11 we discussed prior work on subjective interestingness measures. 
Here we wish to highlight some further connections with the literature. 



6.2.1 Literature related to the MaxEnt model 

It is instructive to point out how some existing models are related to particular 
cases of the MaxEnt models introduced in this paper. 

Most importantly, the MaxEnt model for binary matrices introduced in 
this paper i s form ally identical to the Rasch model, known from psychomct- 
ncs (|RaschL Il96lh . This model was introduced to model the performance of 
individuals (rows) to questions (columns) in a questionnaire. The matrix ele- 
ments indicate which questions were answered correctly or incorrectly for each 
individual. The Lagrange multipliers are interpreted as persons' abilities for 
the row variables A[, and questions' difficulties AJ. Somewhat remarkably, the 
model was not derived from the MaxEnt principle but stated directly. 
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A sim ilar connection exis ts with the so-called p* models from social network 
analysis ([Robins et all 120071) . Although motivated differently, the p\ model in 
particular is formally identical to our MaxEnt model when applied to adjacency 
matrices for unweighted networks. 

Thus, the present paper provides an additional way to look at these widely 
used models from psychometrics and social network analysis. Furthermore, as 
we have shown, the MaxEnt approach suggests generalizations, in particular 
towards non-binary databases and weighted networks. 

When applied to adjacency matrices of networks, the MaxEnt model is re- 
lated toj^iidoinnfitwork models for networks with prescribed degree sequences 
(see iNewmanl (|2003l) and references therein ). The most similar model to the 
ones discussed in this paper is the one from lChung and Lul f|2004h . In this pa- 
per, the authors propose to assume that edge occurrences are independent, with 
each edge probability proportional to the product of the degrees of the pair of 
nodes considered. In the notation of the present paper: 



P(D)=l[P(D(i,j)) with P(D(i,j)) = 



didj 



where s — di. Also for this model the constraints on the expected row and 
column sums are satisfied. 

It would be too easy to simply dismiss this model by stating that among 
all distributions satisfying the expected row and column sum constraints, it is 
not the maximal entropy one, such that it is biased in some sense. However, 
this drawback can be made more tangible: the model represents a probability 
distribution only if max^ didj < s, which is by no means true in all practical 
applications, in particular in power-law graphs. This shortcoming is a symptom 
of a bias of this model: it disproportionally favors connections between pairs 
of nodes both of high degree, such that for nodes of too high degrees the edge 
'probability' suggested becomes larger than 1. A brief remark con sidering a 
similar model for binary databases was made in iGionis et all (|2007h . where it 
was dismissed by the authors on similar grounds. 

The uses of the maximu m entropy principle in statistics are to numerous 
to list ( Javnesl ( 1957 . 1982 ) are good starting points). Of particular interest 
to this paper is the prior use of the maximum entropy objective as a regu- 
lariz er in image reconstructi on, and more specifically in computer tomography 
(e.g. iGull and Skillind . Il984h . Here, the intensity distribution over the image is 
regarded as a probability distribution, to be inferred from integrals of this dis- 
tribution along various paths. This is similar to the MaxEnt model for binary 
databases presented in the current paper, where the paths correspond to the 
rows and columns. 

The maximum entropy principle has also been used before in data mining, 
albeit it for a different purpose and in a different manner than in this paper 
(e.g. not for t he incorporat ion of prior information into a background model). 
For example in Tatti ( 20081) the frequency of an itemset was contrasted with an 
estimate based on the frequency of all its subsets, estimated using the maximum 
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entropy principle. In that paper an itemset was considered more interesting 
when the actual and estimated frequency differed more s trongly, thus defining 
an objective interestingness measure. In ISavino v| (12004 a similar maximum 
entropy model had already been used to com e up with upper an d lower bounds 
on the possible support values for itemsets. In lPavlov et a 1 (|2003l) , the maximum 
entropy principle was used for query count approximation on binary databases. 
Here, the number of results of a query was estimated by relying on probabilistic 
models of the rows (transactions) in the database. One of the probabilistic 
models considered in this paper was the maximum entropy model subject to 
the knowledge of the frequency of the frequent itemsets. For computational 
reasons, the maximum entropy model here was computed at query time, for 
just the small subset of variables involved in the query. 



6.2.2 Literature related to the compression ratio as interestingness 
measure 

The self-i nformation describe d in Sec. I5.2l is most strongly related to the surface 
of a tile (jGeerts et all . [2004). Indeed, if the expected row sums are all equal 
and similarly for the column sums, Pr(D(i,j) = 1) under the MaxEnt model 
is constant throughout the databa se and the se l f-infor mation of a tile is simply 
proportional to its size |/| x |J|. In lGeerts et all (|2004f) . each tile was attributed 
an equal cost as well, and the problem of finding an interesting set of tiles was 
formalized as finding the set of tiles of a given maximal size that maximizes 
the number of database entries covered. Then it was observed that solving this 
optimization problem can be approximated using an (unweighted) budgeted 
maximum coverage problem. 

Hence, our method can be regarded as a refinement of tiling databases in 
two ways: by giving each database entry a different weight (related to the 
MaxEnt distribution), and by giving each tile a different cost (depending on its 
description length) . These two modifications make a dramatic difference in the 
subjective quality of the result, as demonstrated in Sec. _ 

" 20061) . 



■ El 



Another method that is somewhat related is KRIMP ()Siebes et all , 
which searches for a set of itemsets that allow one to compress a database. This 
approach is motivated by the minimum description length principle, regarding 
data mining essentially as a compression process: a pattern set is considered 
(objectively) interesting if it has a short description length and simultaneously 
allows one to describe the data concisely. This leads to objective interestingness 
measures for patterns. In our approach, we are also searching for a pattern set 
with a small description length. However, we are not concerned with describing 
the entire data as concisely as possible, but instead we want to concisely describe 
unanticipated aspects of the data using the pattern set. This makes the resulting 
interestingness measure subjective in nature. 

Lastly, we should note that in a recent conference paper that however did 
not discuss the MaxEnt model in detail, we introduced a similar measure of 



interestingness for noisy tiles ( Kontonasios and De Bie . 2010|). which have also 
been discussed by other authors in different contexts, e.g. in lGionis et all (|2004f ) 
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and Miettinen et a] (2008). However, in the current paper we chose to present 
an interestingness measure for noise-free tiles, as it is sufficient to illustrate the 
general framework from Sec. [5] without risking to overload the reader by the 
technicalities related to noisy tiles. 



7 Experiments 

In this Section we assess the computational cost of fitting the MaxEnt model 
from Sec. [3] for given expected row and column sums. We also assess the cost 
and empirical properties of sampling random databases from the MaxEnt model, 
and we empirically compare this to swap randomizations. Additionally, we will 
show that the MaxEnt model can be fitted efficiently to very large networks as 
well. Finally, we will assess the compression ratio as a subjective interestingness 
measure for tiles by using it for three document databases. 

All experiments were done on a 2GHz Pentium Centrino with IGB Memory, 
and the code used for each of the experiments will be made freely available on 
www . t i j ldebie . net/sof tware/maxent . 



7.1 The MaxEnt model for binary databases 

We report empirical results on t en databases: sev en databases commonly used 



for evaluation purposes (Retail (IBriis et al 



star, Connect (Asuncion an d Newmanl . I2007T) . TI0I4D100K, and T40II0DI00K 



1999t ). Mushroom, Pumsb, Pumsb 



(jAgrawal and Srikanu [1994)), as well as the following three textual datasets 
turned into databases by considering words as items and documents as transac- 
tions: 

ICDM. All ICDM paper abstracts until 2007. Each abstract is represented by 
a transaction and words are items, after stop word removal and stemming. 

KDD. All KDD paper abstracts between 2001 and 2008 (from all sessions) 
downloaded from the ACM website. Each abstract is represented by a 
transaction and words are items, after stop word removal and stemming. 

Pubmed. All Pubmed abstracts retrieved by querying with the search query 
"data mining" , after stop word removal and stemming. 

Some statistics are gathered in Table[2j The Table also mentions support thresh- 
olds used in some of the experiments reported below, as well as the numbers of 
closed itemsets satisfying these support thresholds. 

For each of these databases we computed the MaxEnt model with expected 
row and column sums equal to the observed row and column sums in the data. 



Fitting the MaxEnt model The method we used to fit the mo del is a pre- 
cond itioned gradient descent method with Jacobi preconditioner (e.g. lShewchuk . 
1994 ). implemented in C++. It is conceivable that more sophisti cated methods 



will lead to significant further speedups (e.g. methods discussed in Bovd and Vandeberghei 
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Table 2: Some statistics for the databases investigated: the number of items 
(columns) and transactions (rows) in the database, the support threshold used in 
the experiments involving closed itemset mining, the resulting number of closed 
itemsets, and the average length of each transaction (row) in the databases. 





# items 


# trans- 


support 


# closed 


average 






actions 


used 


itemsets 


transaction 
length 


ICDM 


4,976 


859 


5 (0.6%) 


365,249 


48.9 


KDD 


6,154 


843 


5 (0.6%) 


2,787,847 


65.2 


Pubmed 


12,661 


1,683 


10 (0.6%) 


1,245,454 


74.1 


Mushroom 


120 


8,124 


81 (1%) 


78,362 


23.0 


Retail 


16,470 


88,162 


9 (0.01%) 


191,088 


10.3 


Pumsb 


7,117 


49,046 


34,332 (70%) 


242,001 


74.0 


Pumsb star 


7,117 


49,046 


14,714 (30%) 


16,486 


50.5 


Connect 


130 


67,557 


40,534 (60%) 


68,349 


43.0 


T10I4D100K 


1,000 


100,000 


100 (0.1%) 


26,962 


10.0 


T40I10D100K 


1,000 


100,000 


1,000 (1%) 


65,236 


40.0 



(2004). many of which have guaranteed super- linear convergence), but this one 
is empirically fast enough and has the advantage of being particularly easy to 
implement. 

To illustrate the speed to compute the MaxEnt distribution, Fig. [2] shows 
plots of the convergence of the squared norm of the gradient to zero, for the 
first 25 iterations. The initial value for all Lagrange multipliers was chosen to 
be equal to 0. Noting the logarithmic vertical axis, the convergence appears 
exponential. The lower plot in Fig. [5] shows the convergence of the Lagrange 
dual objective to its minimum over the iterations, a very fast convergence in 
just a few iterations. 

In all experiments we stopped the iterations as soon as the normalized 
squared norm of the gradient became smaller than 10~ 12 , which is close to 
machine accuracy and we believe it is accurate enough for all practical pur- 
poses. The number of iterations required and the overall computation time are 
summarized in Table [3] 



Assessing data mining results Here we illustrate the use of the MaxEn t 
model for assessing data mining results in the same spirit as lGionis et all (|2007l ). 
Figure [3] plots the number of closed itemsets retrieved on the original data as 
a function of the itemset size. Additionally, it shows averages with 5th-95th 
percentile error bars for the results obtained on randomly sampled databases 
from the MaxEnt model with expected row sums and column sums constrained 
to be equal to their values on the original data. If desired, one could extract 
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Figure 2: Top: the squared norm of the gradient on a logarithmic scale as a 
function of the iteration number, plotted for four databases: KDD abstracts, 
Mushroom, Pubmed abstracts, and Retail. This plot shows the exponential de- 
crease of the gradient of the Lagrange dual optimization problem. In the second 
plot, the convergence of the Lagrange dual is shown for the same databases. 
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Figure 3: For the ten datasets under investigation, these plots show the number 
of closed itemsets on a logarithmic scale, as a function of their size (solid blue 
line). We also computed the average number of closed itemsets as a function 
of their size found on 100 randomized datasets, along with error bars for the 
5th and 95th percentile are shown. The results are plotted both for the swap 
randomization approach (green dotted line) and the MaxEnt sampling approach 
(red dashed line). 
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Table 3: The number of iterations required, and the computation time in seconds 
to fit the MaxEnt model. 





# iterations 


time (s) 


ICDM 


13 


0.35 


KDD 


13 


0.50 


Pubmed 


15 


1.21 


Mushroom 


35 


0.012 


Retail 


18 


2.0 


Pumsb 


36 


0.10 


Pumsb star 


37 


0.94 


Connect 


33 


0.048 


T10I4D100K 


34 


2.0 


T40I10D100K 


33 


5.3 



one global measure from these results, as in iGionis et all ([20071 ) . and compute 
an empirical p-value by comparing that measure obtained on the actual data 
with the result on the randomized versions. However, the plots given here do 
not force one to make such a choice, and they still give a good idea about which 
itemset sizes are significant in the datasets. 

Figure [3] also shows averages and error bars for the r esults obtained on 
databases randomized using swaps, done using the code from lGionis et al ( 2007 ) 
a nd using five t imes the number of nonzero database entries (as recommended 



Gionis et al ( 20071 )). It can be noted that the error bars strongly overlap in 



most cases. The difference between the two randomization strategies is largest 
for the mushroom dataset. Interestingly, this is a dataset where the transaction 
sizes are fixed, such that the MaxEnt modeling approach may indeed yield 
qualitatively different results as compared to the swap randomization approach. 
Indeed, sampling from the MaxEnt model only conserves the transaction sizes 
in expectation. A way around this problem is sketched at the end of Sec. 15.11 



Computational cost compared to swap randomizations The above shows 
that randomizing databases using swap randomizations leads to results similar 
to those obtained by sampling from a fitted MaxEnt model. However, the 
MaxEnt strategy is five to fifteen times more efficient in generating one ran- 
domized database (including the overhead for fitting the MaxEnt model), and 
about thirty times more efficient when several randomized databases need to be 
sampled. These computational results are summarized in Fig. 2J 



7.2 The MaxEnt model for various types of networks 

Artificially generated power-law networks To assess the feasibility of us- 
ing the MaxEnt modeling strategy for networks, we artificially generated power- 
law (weighted) degree distributions for networks of various sizes between n = 10 
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Computation time to sample from MaxEnt model 
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of computation times using swaps versus MaxEnt 
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Figure 4: Top: The computation times for randomizing the ten databases con- 
sidered by fitting the MaxEnt model and sampling from this model. The com- 
putation time is split up in the time required to fit read the data, compute the 
marginals, fit the model, and sample one database from the model. Note that in 
order to sample 100 databases, only the last component would need to be done 
100 times. Middle: The computation times for randomizing using swaps, split 
up into a component for reading the data and a component for generating one 
database. Bottom: The ratio of the computation times required by both meth- 
ods, considering the total time to generate one randomization, and considering 
only the part that needs to be repeated if multiple databases are to be sampled. 
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and n = 10 6 nodes, with a power-law exponent of 2.5. I.e., for each number 
of nodes n we sampled n expected (weighted) degrees di from the power-law 
distribution P(di) ~ d~ 2 ' 5 . A power-law degree dis t ribut ion with this expo- 



nent is often observed in realistic networks (jNewmanl . 120031) , so we believe this 
is a representative set of examples. For each of these degree distributions, we 
fitted four different types of undirected networks: unweighted networks with 
and without self-loops, and positive integer-valued weighted networks with and 
without self-loops. 

To fit the MaxEnt models for networks we made use of Newton's method, 
which we implemented in MATLAB. As can be seen from Fig. [5j the compu- 
tation time was under 30 seconds even for the largest network with 10 6 nodes. 
The number of Newton iterations is less than 50 for all models and degree 
distributions considered. 

Real-life networks We also fitted the MaxEnt model to two large real-life 
networks: 



A symmetrized snapshot of the internet created by Mark Newman in July 
20060 This is an undirected unweighted network. The number of nodes 
in this network is 22, 963, the total number of different degrees is 161, with 
degrees ranging between 1 and 2390. Fitting the MaxEnt model required 
0.73 seconds. 



A network of movie actors ( Barabasi and Albert . 1999f) . where edges be- 



tween actors are weighted by the number of movies in which they jointly 
appeared. I.e., this is an undirected integer- valued weighted network. The 
number of nodes (actors) in this network is 127, 823, the total number of 
unique degrees is 2, 526, with values ranging between and 8, 382. Fitting 
the MaxEnt model required 689 seconds, much larger than for the internet 
network mostly due to the larger number of unique degrees. 



General remarks on the network experiments This fast performance can 
be achieved thanks to the fact that the number of different degrees observed in 
the degree distribution is typically much smaller than the size of the network 
(see discussion in Sec. 13. 4p . The bottom graph in Fig. [5j showing the number 
of Lagrange multipliers as a function of the network size supports this. The 
memory requirements remain well under control for the same reasons. 

It should be pointed out that in the worst case for dense or for weighted 
networks (and in particular for real- valued weights), the number of distinct 
expected weighted degrees and hence the number of Lagrange multipliers can 
be as large as the number of nodes n. This would make it much harder to use off- 
the-shelf optimization tools for n much larger than a few thousands. However, 
the problem can be made tractable again if it is acceptable to approximate the 
expected weighted degrees by grouping subsets of them together into bins, and 

3 Available from http://www-personal.umich.edu/~mejn/netdata/. 
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Figure 5: Top: The computation time as a function of the size number of nodes 
in the network (left). A marker x is used for the unweighted model with self- 
loops, o for the unweighted model without self-loops, V for the weighted model 
with self-loops, and □ for the weighted model without self-loops. Note the log- 
log scale. Middle: The number of iterations required by the Newton algorithm 
before convergence. Note the log-scale on the horizontal axis. Bottom: the 
number of Lagrange multipliers (i.e. the number of Lagrange multipliers in the 
Lagrange dual of the MaxEnt optimization problem) for the degree sequences 
investigated, as a function of the network size. Again, note the log-scale on the 
horizontal axis. 
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replacing their expected degree values by a bin average. In this way the number 
of Lagrange multipliers can be reduced to an acceptable level. 



7.3 Using the MaxEnt model to find interesting sets of 
tiles 

Here we aim to demonstrate the use of the MaxEnt model in formalizing sub- 
jective intercstingness where the data miner has prior information on the row 
and column sums. We applied the method for finding interesting sets of tiles 
from Sec. !5.2l to the three abstract datasets discussed above (KDD, ICDM, and 
Pubmed, see Sec. l7.ip . The particular type of prior information, on the row and 
the column marginals, makes sense in this setting. Indeed, if words are jointly 
contained in many documents purely because they are frequent individually, this 
association is not very meaningful. Similarly, if documents share many words 
purely because they are long and contain many words, this is not interesting 
either. Our results below demonstrate that by assuming the frequency of words 
and lengths of documents as prior information and searching for patterns that 
contrast with that using our methodology, discovery of such trivial associations 
is avoidedQ 

We first mined all tiles covering a number of documents e qual to the mini- 
mum support threshold mentioned in Table [5] using CHARM ( Zaki and Hsiaol 



20021) . and subsequently ran the greedy approximation to the maximum bud 



geted coverage problem to construct a sorted list of the most interesting of these 
tiles. As discussed in Sec. \2.2\ this sorted list has the property that each set of 
k top-ranked tiles has close to the maximally achievable self-information given 
its total description length. 

The top-15 tiles as returned by our method are summarized in the left column 
of Table 21 (Note that the choice to show just 15 tiles is arbitrary.) We report 
the sets of words sorted alphabetically, as well as the size of the set of docume nts. 
For comparison, the results of the related method from iGeerts et a 3 (12004 are 
shown in the right column. 

We argue that our method achieves the most sensible results in terms of 
non-redundancy and interestingness of the highly ranked tiles, many of them 
coinciding with major topics and concepts in data mining (ICDM, KDD) and 
data mining applied to biological problems (Pubmed). In contrast, the tile- 
based method seems to favor tiles with few but individually frequent items. 

Figure |5] shows the compression ratio for the 50 top-ranked tiles in each 
of the three abstract databases. For reference, we also ran our method on a 
randomized version of each of these three databases (randomized as described 
in Sec. 15.11) and we show the resulting compression ratios on the same plots. 
Although the compression ratio exceeds 1 only for a few top-ranked tiles on the 
actual databases, it is higher than the largest compression ratio seen on the 



4 Note that we do not want to sell our method for natural language processing. It simplifies 
documents to bags of words, disregarding linguistics, and can therefore not be expected to 
perform similarly. Our sole purpose in applying our method to text is to show its properties, 
which is hard to do on other databases typically used in this area of research. 
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randomized databases at least for all of the 50 top-ranked tiles for which data 
is shown in these plots. This corroborates that the top-ranked tiles are indeed 
meaningful and cannot be attributed to randomness. 




50 



50 



50 



Figure 6: The compression ratio of the top-50 ranked tiles as a function of 
their rank, evaluated on the three abstract databases (full line) as well as on a 
randomized version of these databases (dashed lines). 



We believe that the true assessment of a subjective measure should be 
the subjective one provided just above. Still, we also conducted an objective 
comparison betwee n our interestingness measure and the tiling method from 
Geerts et all ( 2004 ). We took each of the 3 abstract databases and artificially 



embedded large tiles in these databases by appending k rows ('documents') as 
well as k columns ('words') to it, ensuring that the database cells in the intersec- 
tion of the k new rows and the k new columns are all equal to 1. To ensure that 
the overall statistics of the database are maintained, we furthermore randomly 
added ones in the cells in the intersection of the existing columns and the new 
rows, so as to ensure that the column densities remain the same. Similarly, we 
randomly added ones to the cells in the intersection of the existing rows and new 
columns to ensure the row frequencies are unchanged. We did this for various 
tile sizes k, namely for k = 5, 10, 15, and 20. 

We then computed a ranking of tiles based on our newly proposed com- 
pression ratio inter estingness measure, as well as based on the surface of a tile 
(jGeerts et all . 120041 ) . We then compared the rank of the embedded tile (or if 
possible a larger one containing it) in the ranking returned by these two inter- 
estingness measures. The results, shown in Table [5J clearly demonstrate that 
our proposed approach is much more effective in finding the embedded tile. 



8 Conclusions 

A significant amount of data mining research has been devoted to the assessment 
of data mining results when contrasted to prior information, leading to the 
notion of subjective measures of interestingness. A key task in this endeavour 
is the formalization of prior information. 
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In this paper, we have introduced a new modeling approach for prior infor- 
mation based on the maximum entropy principle. Fitting the resulting MaxEnt 
distribution boils down to a well-posed convex optimization problem. We have 
also outlined various ways in which the MaxEnt model can be used to contrast 
patterns in data with prior information, in order to come up with subjective 
interestingness measures. 

Applying this general framework to rectangular databases and prior infor- 
mation on the row and column sums, it turns out that the MaxEnt model can 
be represented particularly compactly, and specific properties can be exploited 
to dramatically enhance computational efficiency. Furthermore, we showed how 
the MaxEnt model can be used efficiently to sample random databases satis- 
fying the prior information. Finally, we also worked out the details of a new 
interestingness measure for tiles, referred to as the compression ratio, that takes 
account of row and column sums as prior information. 

In our further work we will investigate other interesting use cases of the 
general framework laid out in Sec. [5J For example, in l i ne wi th the alternative 
randomization strategies suggested in lHanhiiarvi et al <l2009h . we will investi- 



gate other types of prior information on rectangular databases, such as on the 
variance within rows or columns (then the domain of the database elements can 
be chosen to be T> = M and the resulting MaxEnt distribution would be a prod- 
uct of Gaussia n distributio n s), th e sup port of c ertain itemsets (for which also 
the ideas from Pavlov et al ( 2003 ) and Caldersl ( 20081) will prove useful), or an 



existing cluster structure in the data. Furthermore, the strategy can be applied 
to other data types as well, such as categorical data (see end of Sec. 15. I|) . In 
parallel, we will investigate the use of the MaxEnt modeling strategy for other 
types of data such as relational databases. 
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Table 4: The left column in this Table shows the sets of words (columns) J 
and the number of documents |7| containing all these words for the top- 15 
selected tiles (J, J) by the method described in Sec. l5.2[ applied to three abstract 
data sets. The right colu mn gives the results when the tiling databases method 
from lGeerts et all (2004) is used. 
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Table 5: The ranks of the embedded tile for varying size k in the rankings re- 
turned by our compression ratio method and the tiling method from lGeerts et al 
( 2004 ). Note that for tile size 5, no results are available for the Pubmed database 
as the support threshold there is 10 such that the embedded tile is not retrieved. 
Clearly, the newly proposed method is much more effective at ranking the em- 
bedded tile highly. 
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